From:                     Robert Eberl

Sent:                      Wednesday, June 07, 2000 10:28 PM

To:                         Research-Redmond (ALL)

Subject:                 Slow Network problems

Hello,

We are receiving many reports from Researchers that their network performance is terrible for the last few days.  This email is to help explain what is happening.

The issue that is creating the apparently "slow" response to various efforts is due to a Redmond Domain problem, the dev teams are involved and people in ITG are doing everything they can to correct the situation.

 

MSRSupp is aware of these issues, but has no ability to change/correct/circumvent the environment, if you think the problem you may be experiencing is due to the description below, please hang in there with the rest of us until a solution is implemented.  If you feel your issue is outside these events, please email us (msrsupp) for assistance.

 

Thank you for your patience, here is the issue as reported on http://gnsweb/asddt/RedmondIssues.htm:

 

REDMOND Domain issues, 6/6 19:00

ASDDT and NTDev are still investigating chronic resource depletion issues on the REDMOND PDC FSMO holder. The server was running out of available heap to perform DS operations, even though there is apparently more than 1GB of physical memory available. The problem affects any server acting as the PDC, so moving the role to another machine to alleviate the problem will not work. When we get into this state, we found that we cannot even successfully transfer the role. We must reboot the affected server or take it offline and manually seize the role to another DC. If we take that step, it is only a matter of time until the problem reoccurs. NTDev had been looking at \\red-dc-02 for the past few days, which we captured in this state last week when it was the PDC, and had not come to a conclusion about the root cause of the problem or what may possibly solve it until late today.

This morning the server holding the PDC role, \\red-dc-00, started suffering from this problem. Affected users contacted us directly with various types of authentication issues, so we began investigating. An unsuccessful attempt was made to transfer the role to \\red-dc-01, so we had to reboot \\red-dc-00 and seize the role back to that server. When this was done, the server failed to take possession of the WINS 1B (Domain Master Browser) record for unknown reason. (Possibly a side-effect of the failed role transfer.) Even though the PDC was back online, this can have the effect of downlevel clients and applications failing authentication and/or being unable to logon. We manually updated the record and initiated a push trigger to speed up replication. We attempted to cycle the Workstation and Server service on \\red-dc-00 to grab the 1B record, which resulted in Net Logon becoming wedged. We rebooted the server again, which returned the PDC to normal function.

We reiterated the effects of the problem to NTDev and the urgency to find a resolution. They found that lsass.exe (core to authentication and DS operation) was running out of virtual address space. We did some digging and discovered that since we had enabled large virtual address space for the OS with the "/3gb" option, by suggestion of NTDev to solve a previous "out of version storage" issue that was causing changes not to be applied to the AD, the DS is much more aggressive in how it uses the memory cache. Looking at the file header for lsass.exe, we found that it was not linked to be 3GB aware and use this functionality. With the "/3gb" switch and the built-in deficiency in lsass.exe, the OS effectively takes away 1GB of address space that LSA could have used, it runs itself out of VM, and we see the problems begin.

At the suggestion of Dev, we have re-linked lsass.exe on \\red-dc-06 to enable the 3gb functionality. We have engaged ISM to bring \\red-dc-02 back online, remove the "/3gb" switch, and move the PDC role back to that server. ASDDT and NTDev will then compare the memory usage on the two servers and make a determination whether making this behavior change to lsass.exe will work to solve the memory issues. We have also asked them to bring \\red-dc-00 offline for unrelated issues.

We continue to be engaged on this with NTDev and ISM to get this problem solved and will work with them to provide status, but we may continue to see service-affecting issues. We will post status and description of these issues we are working on Opsweb under the "ASDDT" section. If new problems occur please escalate them as appropriate and help us keep others informed before taking action, so we can make sure we are all working together.